Goto

Collaborating Authors

 human genome



BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects

Li, Hongyang, Dey, Sanjoy, Kwon, Bum Chul, Danziger, Michael, Rosen-Tzvi, Michal, Hu, Jianying, Kozloski, James, Tsou, Ching-Huei, Dandala, Bharath, Meyer, Pablo

arXiv.org Artificial Intelligence

Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic


When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

Popova, Marina, Chelombitko, Iaroslav, Komissarov, Aleksey

arXiv.org Artificial Intelligence

The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte-Pair Encoding (BPE) to nine T2T primate genomes--including three human assemblies--by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE .


The Dire Wolf Is Back

The New Yorker

Extinction is a part of nature. Of the five billion species that have existed on Earth, 99.9 per cent have vanished. The Triassic-Jurassic extinction, two hundred million years ago, finished off the crocodile-like phytosaur. Sixty-six million years ago, the end-Cretaceous extinction eliminated the Tyrannosaurus rex and the velociraptor; rapid climate change from an asteroid impact was the likely cause. The Neanderthals disappeared some forty thousand years ago. One day--whether from climate change, another asteroid, nuclear war, or something we can't yet imagine--humans will probably be wiped out, too.


A Phylogenetic Approach to Genomic Language Modeling

Albors, Carlos, Li, Jianan Canal, Benegas, Gonzalo, Ye, Chengzhong, Song, Yun S.

arXiv.org Artificial Intelligence

Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.


scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers

Meng, Dian, Xing, Bohao, Huang, Xinlei, Liu, Yanran, Zhou, Yijun, xiao, Yongjun, Yu, Zitong, Zheng, Xubin

arXiv.org Artificial Intelligence

Single-cell multi-omics (scMulti-omics) refers to the paired multimodal data, such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), where the regulation of each cell was measured from different modalities, i.e. genes and proteins. scMulti-omics can reveal heterogeneity inside tumors and understand the distinct genetic properties of diverse cell types, which is crucial to targeted therapy. Currently, deep learning methods based on attention structures in the bioinformatics area face two challenges. The first challenge is the vast number of genes in a single cell. Traditional attention-based modules struggled to effectively leverage all gene information due to their limited capacity for long-context learning and high-complexity computing. The second challenge is that genes in the human genome are ordered and influence each other's expression. Most of the methods ignored this sequential information. The recently introduced Test-Time Training (TTT) layer is a novel sequence modeling approach, particularly suitable for handling long contexts like genomics data because TTT layer is a linear complexity sequence modeling structure and is better suited to data with sequential relationships. In this paper, we propose scFusionTTT, a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder. Of note, we combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis. Finally, the model employs a three-stage training strategy, which yielded the best performance across most metrics in four multimodal omics datasets and four unimodal omics datasets, demonstrating the superior performance of our model. The dataset and code will be available on https://github.com/DM0815/scFusionTTT.


In case of extinction, scientists store human genome on a 'memory crystal' that lasts billions of years

Popular Science

Technology Engineering In case of extinction, scientists store human genome on a'memory crystal' that lasts billions of years The disc is as tough as quartz and withstands cosmic radiation. Breakthroughs, discoveries, and DIY tips sent every weekday. Researchers have encoded the entire human genome onto a " 5D memory crystal " in the off chance our species finds itself needing to walk back from the brink of extinction. But even if the plan ultimately fails, the device itself is theoretically capable of providing our genetic code to some other future, sentient third-party, even if it takes them billions of years to find it. For over a decade, the gold standard for the most durable data storage medium has been crystal.

  Country:

Scientists discover potential secret to reversing aging

Daily Mail - Science & tech

Ancient viruses, whose DNA has hitchhiked within the human genome for millennia, may be the cause of many age-related conditions. Scientists have proven for the first time that they can use this viral DNA -- known as'retroelements' -- to predict the age of human cells with'high accuracy.' In recent years, this seemingly inactive'junk' DNA from retroelements has been linked to everything from sleep patterns and memory formation to bipolar disorder. Armed with their new ability to track a person's age via this ancient viral DNA, the scientists now plan to investigate whether new antiviral treatments could reverse the conditions of aging, by deactivating the worst of these viral'retroelement' genes. The new research harnesses previously unknown features of this ancient viral DNA, creating a biological clock to track a person's age from the DNA's chemical changes.


G4-Attention: Deep Learning Model with Attention for predicting DNA G-Quadruplexes

Mukherjee, Shrimon, Pramanik, Pulakesh, Basuchowdhuri, Partha, Bhattacharya, Santanu

arXiv.org Artificial Intelligence

G-Quadruplexes are the four-stranded non-canonical nucleic acid secondary structures, formed by the stacking arrangement of the guanine tetramers. They are involved in a wide range of biological roles because of their exceptionally unique and distinct structural characteristics. After the completion of the human genome sequencing project, a lot of bioinformatic algorithms were introduced to predict the active G4s regions \textit{in vitro} based on the canonical G4 sequence elements, G-\textit{richness}, and G-\textit{skewness}, as well as the non-canonical sequence features. Recently, sequencing techniques like G4-seq and G4-ChIP-seq were developed to map the G4s \textit{in vitro}, and \textit{in vivo} respectively at a few hundred base resolution. Subsequently, several machine learning approaches were developed for predicting the G4 regions using the existing databases. However, their prediction models were simplistic, and the prediction accuracy was notably poor. In response, here, we propose a novel convolutional neural network with Bi-LSTM and attention layers, named G4-attention, to predict the G4 forming sequences with improved accuracy. G4-attention achieves high accuracy and attains state-of-the-art results in the G4 prediction task. Our model also predicts the G4 regions accurately in the highly class-imbalanced datasets. In addition, the developed model trained on the human genome dataset can be applied to any non-human genome DNA sequences to predict the G4 formation propensities.


Science Is Becoming Less Human

The Atlantic - Technology

This summer, a pill intended to treat a chronic, incurable lung disease entered mid-phase human trials. Previous studies have demonstrated that the drug is safe to swallow, although whether it will improve symptoms of the painful fibrosis that it targets remains unknown; this is what the current trial will determine, perhaps by next year. Such a tentative advance would hardly be newsworthy, except for a wrinkle in the medicine's genesis: It is likely the first drug fully designed by artificial intelligence to come this far in the development pipeline. The pill's maker, the biotech company Insilico Medicine, used hundreds of AI models to discover both a new target in the body that could treat the fibrosis and which molecules might be synthesized for the drug itself. Those programs allowed Insilico to go from scratch to putting this drug through the first phase of human trials in two and a half years, rather than the typical five or so.